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Abstract 



We consider the matrix completion problem under a form of row/column weighted 
entrywise sampling, including the case of uniform entrywise sampling as a special case. 
We analyze the associated random observation operator, and prove that with high proba- 
ta • bility, it satisfies a form of restricted strong convexity with respect to weighted Frobenius 
norm. Using this property, we obtain as corollaries a number of error bounds on matrix 
completion in the weighted Frobenius norm under noisy sampling and for both exact and 
near low-rank matrices. Our results are based on measures of the "spikiness" and "low- 
rankness" of matrices that are less restrictive than the incoherence conditions imposed 
in previous work. Our technique involves an M-estimator that includes controls on both 
the rank and spikiness of the solution, and we establish non-asymptotic error bounds in 
weighted Frobenius norm for recovering matrices lying with l q - "balls" of bounded spiki- 
CM ■ ness. Using information-theoretic methods, we show that no algorithm can achieve better 
estimates (up to a logarithmic factor) over these same sets, showing that our conditions 
on matrices and associated rates are essentially optimal. 



CM , 

q{ '. 1 Introduction 

O 

Matrix completion problems correspond to reconstructing matrices, either exactly or approx- 
imately, based on observing a subset of their entries \1'6\ [5] . In the simplest formulation of 
matrix completion, the observations are assumed to be uncorrupted, whereas a more general 
^ ■ formulation (as considered in this paper) allows for noisiness in these observations. Matrix 

recovery based on only partial information is an ill-posed problem, and accurate estimates are 
possible only if the matrix satisfies additional structural constraints, with examples includ- 
ing handedness, positive semidefiniteness, Euclidean distance measurements, Toeplitz, and 
low-rank structure (see the survey paper [13J and references therein for more background). 

The focus of this paper is low-rank matrix completion based on noisy observations. This 
problem is motivated by a variety of applications where an underlying matrix is likely to 
have low-rank, or near low-rank structure. The archetypal example is the Netflix challenge, 
a version of the collaborative filtering problem, in which the unknown matrix is indexed 
by individuals and movies, and each observed entry of the matrix corresponds to the rating 
assigned to the associated movie by the given individual. Since the typical person only watches 
a tiny number of movies (compared to the total Netflix database), it is only a sparse subset of 
matrix entries that are observed. In this context, one goal of collaborative filtering is to use 
the observed entries to make recommendations to a person regarding movies that they have 
not yet seen. We refer the reader to Srebro's thesis [26] (and references therein) for further 
discussion and motivation for collaborative filtering and related problems. 
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In this paper, we analyze a method for approximate low-rank matrix recovery using an 
M-estimator that is a combination of a data term, and a weighted nuclear norm as a regu- 
larizer. The nuclear norm is the sum of the singular values of a matrix |10| . and has been 
studied in a body of past work, both on matrix completion and more general problems of 
low-rank matrix estimation (e.g., [91 [26l [271 ESI [231 El O [22l HH H21 HH1 El]). A parallel line 
of work has studied computationally efficient algorithms for solving problems with nuclear 
norm constraints (e.g, [TTlEQlIIB]). Here we limit our detailed discussion to those papers that 
study various aspects of the matrix completion problem. Motivated by various problems in 
collaborative filtering, Srebro and colleagues [261 123 EH] studied various aspects nuclear norm 
regularization; among various other contributions, Srebro et al. [27] established generalization 
error bounds under certain conditions. Candes and Recht [5] studied the exact reconstruc- 
tion of a low-rank matrix given perfect (noiseless) observations of a subset of entries, and 
provided sufficient conditions for exact recovery via nuclear norm relaxation. These results 
were then refined in follow-up work [6, 22j, with the simplest approach to date being provided 
by Recht |22| . In a parallel line of work, Keshavan et al. [12] have studied a method 
based on thresholding and singular value decomposition, and established various results on 
its behavior, both for noiseless and noisy matrix completion. Among other results, Rohde and 
Tsybakov [24] establish prediction error bounds for matrix completion, a different metric than 
the matrix recovery problem of interest here. In recent work, Salakhutdinov and Srebro [25j 
provided various motivations for the use of weighted nuclear norms, in particular showing 
that the standard nuclear norm relaxation can behave very poorly when the sampling is non- 
uniform. The analysis of this paper applies to both uniform and non-uniform sampling, as 
well as a form of reweighted nuclear norm as suggested by these authors, one which includes 
the ordinary nuclear norm as a special case. We provide a more detailed comparison between 
our results and some aspects of past work in Section 13.41 

As has been noted before [4], a significant theoretical challenge is that conditions that 
have proven very useful for sparse linear regression — among them the restricted isometry 
property — are not satisfied for the matrix completion problem. For this reason, it is natural 
to seek an alternative and less restrictive property that might be satisfied in the matrix com- 
pletion setting. In recent work, Negahban et al. [18] have isolated a weaker condition known 
as restricted strong convexity (RSC), and proven that certain statistical models satisfy RSC 
with high probability when the associated regularizer satisfies a decomposability condition. 
When an M-estimator satisfies the RSC condition, it is relatively straightforward to derive 
non-asymptotic error bounds on parameter estimates [18j . The class of decomposable regu- 
larizes includes the nuclear norm as particular case, and the RSC/decomposability approach 
has been exploited to derive bounds for various matrix estimation problems, among them 
multi-task learning, autoregressive system identification, and compressed sensing [19 1 . 

To date, however, an open question is whether or not an appropriate form of RSC holds 
for the matrix completion problem. If it did hold, then it would be possible to derive non- 
asymptotic error bounds (in Frobenius norm) for matrix completion based on noisy observa- 
tions. Within this context, the main contribution of this paper is to prove that with high 
probability, a form of the RSC condition holds for the matrix completion problem, in partic- 
ular over an interesting set of matrices £, as defined in equation ([8]) to follow, that have both 
low nuclear/Frobenius norm ratio and low "spikiness". The set £ also excludes a neighbor- 
hood around zero, which is essential so as to eliminate the nullspace of the sampling operator 
underlying matrix completion. Exploiting this RSC condition then allows us to derive non- 
asymptotic error bounds on matrix recovery in weighted Frobenius norms, both for exactly 
and approximately low-rank matrices. The theoretical core of this paper consists of three 
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main results. Our first result (Theorem [T]) proves that the matrix completion loss function 
satisfies restricted strong convexity with high probability over the set (£. Our second result 
(Theorem [2|) exploits this fact to derive a non-asymptotic error bound for matrix recovery in 
the weighted Frobenius norm, one applicable to general matrices. We then specialize this re- 
sult to the problem of estimating exactly low-rank matrices (with a small number of non-zero 
singular values), as well as near low-rank matrices characterized by relatively swift decay of 
their singular values. To the best of our knowledge, our results on near low-rank matrices 
are the first for approximate matrix recovery in the noisy setting, and as we discuss at more 
length in Section 13.41 our results on the exactly low-rank case are sharper than past work on 
the problem. Indeed, our final result (Theorem [3]) uses information-theoretic techniques to 
establish that up to logarithmic factors, no algorithm can obtain faster rates than our method 
over the £ g -balls of matrices with bounded spikiness treated in this paper. 

The remainder of this paper is organized as follows. We begin in Section [2] with background 
and a precise formulation of the problem. Section [3] is devoted to a statement of our main 
results, and discussion of some of their consequences. In Sections H] and Section [5j we prove 
our main results, with more technical aspects of the arguments deferred to appendices. We 
conclude with a discussion in Section [6l 



2 Background and problem formulation 

In this section, we introduce background on low-rank matrix completion problem, and also 
provide a precise statement of the problem studied in this paper. 

2.1 Uniform and weighted sampling models 

Let 0* 6 ]jj rf rxd c an un k nown matrix, and consider an observation model in which we make 
n i.i.d. observations of the form 

v* = @ hw) + 7^=^ C 1 ) 

Here the quantities —j==^ correspond to additive observation noises with variance appro- 
priately scaled according to the matrix dimensions. In defining the observation model, one 
can either allow the Frobenius norm of 0* to grow with the dimension, as in done in other 
work [H 112] . or rescale the noise as we have done here. This choice is consistent with our 
assumption that O* has constant Frobenius norm regardless of its rank or dimensions. With 
this scaling, each observation in the model ([I]) has a constant signal-to-noise ratio regardless 
of matrix dimensions. 

In the simplest model, the row j(i) and column k(i) indices are chosen uniformly at 
random from the sets {1,2, . . . ,d r } and {1,2, . . . ,d c } respectively. In this paper, we con- 
sider a somewhat more general weighted sampling model. In particular, let R 6 R'kxcir 
and C G M rf cxd c \y e diagonal matrices, with rescaled diagonals {Rj/d r ,j = 1,2,... , d r } and 
{Ck/d c ,k = 1, 2, ...,d c } representing probability distributions over the rows and columns 
of an d r x d c matrix. We consider the weighted sampling model in which we make a noisy 
observation of entry (j,k) with probability RjCk/(d r d c ), meaning that the row index 
(respectively column index k(i)) is chosen according to the probability distribution R/d r (re- 
spectively C/d c ). Note that in the special case that R = 1^ and C = lrf c , the observation 
model ([!]) reduces to the usual model of uniform sampling. 
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We assume that each row and column is sampled with positive probability, in particular 
that there is some constant 1 < L < oo such that R a > 1/L and Cb > 1/L for all rows and 
columns. However, apart from the constraints Ylt=i = d r and Ylt=i ^bb = d c , we do not 
require that the row and column weights remain bounded as d r and d c tend to infinity. 

2.2 The observation operator and restricted strong convexity 

We now describe an alternative formulation of the observation model ([1]) that, while sta- 
tistically equivalent to the original, turns out to be more natural for analysis. For each 
i = 1, 2, . . . , n, define the matrix 



X (l) = \fd\d c Ei e a{i) el {i) , (2) 

where £j G {— 1,+1} is a random sign, and consider the observation model 

Vi = {{X^, + fori = l,...,n, (3) 

where ((A, B)) := Ylj k AjkBjk is the trace inner product, and is an additive noise from 
the same distribution as the original model. The model ([3]) is can be obtained from the 
original model ([I]) by rescaling all terms by the factor \Jd r d c , and introducing the random 
signs £j. The rescaling has no statistical effect, and nor do the random signs, since the noise is 
symmetric (so that £j = £j£j has the same distribution as Thus, the observation model (|3|) 
is statistically equivalent to the original one ([I]). 

In order to specify a vector form of the observation model, let us define an operator 

X n : R d rXd c _^ R n yia 

[Xn(G)]i ■= {{X®, &)), fori = l,2,...n. 

We refer to X n as the observation operator, since it maps any matrix 6 G R drXdc to an n- 
vector of samples. With this notation, we can write the observations ([3]) in a vectorized form 
as y = 3t n (9*) + vt,. 

The reformulation Q is convenient for various reasons. For any matrix G B^ rXdc , we 
have E[((X«, 9))] = and 

e[«x» e» 2 ] = ^X>AV? fc = lVReVcf F: (4) 

where we have defined the weighted Frobenius norm ||| • l^jr) in terms of the row R and column 
C weights. As a consequence, the signal-to-noise ratio in the observation model ([3|) is given 

by the ratio SNR = " |e '5 (F) ■ 

As shown by Negahban et al. [18], a key ingredient in establishing error bounds for the 
observation model (|3|) is obtaining lower bounds on the restricted curvature of the sampling 
operator — in particular, to establish the existence of a constant c > 0, which may be arbitrarily 
small as long as it is positive, such that 

> c |||9||U(F)- (5) 



For sample sizes of interest for matrix completion (n ^ d r d c ) , one cannot expect such 
a bound to hold uniformly over all matrices G R drXdc ^ even when rank constraints are 
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imposed. Indeed, as noted by Candes and Plan [3], the condition © is violated with high 
probability by the rank one matrix 0* such that Q*i = 1 with all other entries zero. Indeed, 
for a sample size n <C d r d c , we have a vanishing probability of observing the entry 0^, so 
that X n (0*) = with high probability. 

2.3 Controlling the spikiness and rank 

Intuitively, one must exclude matrices that are overly "spiky" in order to avoid the phe- 
nomenon just described. Past work has relied on fairly restrictive matrix incoherence condi- 
tions (see Section [33] for more discussion), based on specific conditions on singular vectors of 
the unknown matrix 0*. In this paper, we formalize the notion of "spikiness" in a natural 
and less restrictive way — namely by comparing a weighted form of ^-norm to the weighted 
Frobenius norm. In particular, for any non-zero matrix 0, let us define (for any non-zero 
matrix) the weighted spikiness ratio 

1®|L(oo) 



a sp (0) := Jd~d c (6) 

III ^lllaj(F) 

where l©!^) := \\VR@VC\\ oo is the weighted elementwise foo-norm. Note that this ratio 
is invariant to the scaling of 0, and satisfies the inequalities 1 < a sp (0) < \/d r d c . We have 
a sp(0) = 1 for any non-zero matrix whose entries are all equal, whereas the opposite extreme 
a sp(0) = Vd r d c is achieved by the "maximally spiky" matrix that is zero everywhere except 
for a single position. 

In order to provide a tractable measure of how close is to a low-rank matrix, we define 
(for any non-zero matrix) the ratio 

Aa(6) := J^kiL (7) 
lll fc, IL(F) 



which satisfies the inequalities 1 < /3 m (0) < -y/min{d r , d c }. By definition of the (weighted) 
nuclear and Frobenius norms, note that /3 ra (0) is simply the ratio of the l\ to £2 norms of the 
singular values of the weighted matrix \/ r R@\/ r C. This measure can also be upper bounded 
by the rank of 0: indeed, since R and C are full-rank, we always have 



/3 r 2 a (0) < rank(Vfi0\/C) = rank(0), 
with equality holding if all the non-zero singular values of \/RQ^/C are identical. 



3 Main results and their consequences 

We now turn to the statement of our main results, and discussion of their consequences. 
Section 13.11 is devoted to a result showing that a suitable form of restricted strong convexity 
holds for the random sampling operator X n , as long as we restrict it to matrices A for which 
/3 ra (A) and a sp (A) are not "overly large". In Section [3T2l we develop the consequences of 
the RSC condition for noisy matrix completion, and in Section 13.31 we prove that our error 
bounds are minimax-optimal up to logarithmic factors. In Section \3A\ we provide a detailed 
comparison of our results with past work. 
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3.1 Restricted strong convexity for matrix sampling 

Introducing the convenient shorthand d = ^(d r + d c ), let us define the constraint set 



c ) :=|a g u drXdc ) A / | a sp (A) /3 ra (A) < ^J^^ J, (8) 

where Co is a universal constant. Note that as the sample size n increases, this set allows 
for matrices with larger values of the spikiness and/or rank measures, a sp (A) and /3 ra (A) 
respectively. 

Theorem 1. There are universal constants (co, c\, C2, C3) suc/i i/iai as Zong as n > 030! log d, 
we Ziaue 



l*n(A)|| 2 1 / 128« sp (A) ) 77Ac w s 

— y=— > -|A| w(F) jl - -j= j /or a// A G ^;c ) 



(9) 



with probability greater than 1 — c\ exp(— c 2 d log d). 



Roughly speaking, this bound guarantees that the observation operator captures a sub- 
stantial component of any matrix A G <t(n; cq) that is not overly spiky. More precisely, as 
long as 128< ^.( A ) < i ; the bound ([9]) implies that 

> ^WiiF) for any A G £(n;co). (10) 

This bound can be interpreted in terms of restricted strong convexity [18J. In particular, given 
a vector y G M. n of noisy observations, consider the quadratic loss function 

C(e;y) = ^\\y-X n (@)\\l 

Since the Hessian matrix of this function is given by X n *X n /n, the bound (|10j) implies that 
the quadratic loss is strongly convex in a restricted set of directions A. 

As discussed previously, the worst-case value of the "spikiness" measure is a sp (A) = \Jd r d c , 
achieved for a matrix that is zero everywhere except a single position. In this most degen- 
erate of cases, the combination of the constraints ° s ^l^ < 1 and the membership condition 
A G £(n; cq) imply that even for a rank one matrix (so that /3 ra (A) = 1), we need sample size 
n>d 2 for Theorem [1] to provide a non-trivial result, as is to be expected. 



3.2 Consequences for noisy matrix completion 

We now turn to some consequences of Theorem [1] for matrix completion in the noisy setting. 
In particular, assume that we are given n i.i.d. samples from the model ([3D, and let be 
some estimate of the unknown matrix 0* . Our strategy is to exploit the lower bound Q in 
application to the error matrix — 0* , and accordingly, we need to ensure that it has relatively 
low-rank and spikiness. Based on this intuition, it is natural to consider the estimator 

0Garg min {±-\\y - X n (0)||i + A n |||0L (1) }, (11) 
lieiL^-^ 2n 
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where a* > 1 is a measure of spikiness, and the regularization parameter A n > serves to 
control the nuclear norm of the solution. In the special case when both R and C are identity 
matrices (of the appropriate dimensions), this estimator is closely related to the standard 
one considered in past work on the problem, with the only difference between the additional 
^oo-norm constraint. In the more general weighted case, an M-estimator of the form (|lip using 
the weighted nuclear norm (but without the elementwise constraint) was recently suggested 
by Salakhutdinov and Srebro [25] , who provided empirical results to show superiority of the 
weighted nuclear norm over the standard choice for the Netflix problem. 

Past work on matrix completion has focused on the case of exactly low-rank matrices. 
Here we consider the more general setting of approximately low-rank matrices, including the 
exact setting as a particular case. We begin by stating a general upper bound that applies 
to any matrix O* , and involves a natural decomposition into estimation and approximation 
error terms. 

Theorem 2. Consider any solution O to the weighted SDP (fTT|) using regularization param- 
eter 

1 n 

An > 2i/ 1- V&irix»C-i| op , (12) 
i=i 



and define A* = max{A n , y dl ° 1 f d }- Then with probability greater than 1 — C2 exp(— C2 logd), 
for each r = 1, . . . , d r , the error A = — 0* satisfies 



d 

II 2 



A||£ (F) < ci a* \; 



j=r+l 



(13) 



Notice how the bound (|13p shows a natural splitting into two terms. The first can be 
interpreted as the estimation error associated with a rank r matrix, whereas the second term 
corresponds to approximation error, measuring how far y/R@*y/C is from a rank r matrix. 
Of course, the bound holds for any choice of r, and in the corollaries to follow, we choose r 
optimally so as to balance the estimation and approximation error terms. 



In order to provide concrete rates using Theorem[2l it remains to address two issues. First, 
we need to specify an explicit choice of A n by bounding the operator norm of the matrix 
^HiLi^^^' 1 '^! an d secondly, we need to understand how to choose the parameter 
r so as to achieve the tightest possible bound. When 0* is exactly low-rank, then it is 
obvious that we should choose r = rank(0*), so that the approximation error vanishes — viz. 
J2f= r +i o-j(VRQ*VC)j = 0. Doing so yields the following result: 

Corollary 1 (Exactly low-rank matrices). Suppose that the noise sequence is i.i.d., 
zero-mean and sub- exponential, and 0* has rank at most r, Frobenius norm at most 1, and 

spikiness at most a sp (0*) < a*. If we solve the SDP ([TT]) with X n = 4Lv\J~^j^ then there 
is a numerical constant c[ such that 

III© - ®*ll(F) < A {v 2 v i) K) 2 (14) 

with probability greater than 1 — C2 exp(— C3 log d). 
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Note that this rate has a natural interpretation: since a rank r matrix of dimension d r x d c has 
roughly r{d r + d c ) free parameters, we require a sample size of this order (up to logarithmic 
factors) so as to obtain a controlled error bound. An interesting feature of the bound ([14D 
is the term v 2 V 1 = max{i/ 2 ,l}, which implies that we do not obtain exact recovery as 
v — > 0. As we discuss at more length in Section [3.4| under the mild spikiness condition that 
we have imposed, this behavior is unavoidable due to lack of identifiability within a certain 
radius, as specified in the set <t. For instance, consider the matrix 0* and the perturbed 
version O = 0* -| — -f==e\e\. With high probability, we have 3E n (0*) = X n (0), so that 



the observations — even if they were noiseless — fail to distinguish between these two models. 
These types of examples, leading to non-identifiability, cannot be overcome without imposing 
fairly restrictive matrix incoherence conditions, as we discuss at more length in Section 13.41 

As with past work [U [12] , Corollary [T] applies to the case of matrices that have exactly 
rank r. In practical settings, it is more realistic to assume that the unknown matrix is not 
exactly low-rank, but rather can be well approximated by a matrix with low rank. One way 
in which to formalize this notion is via the ^-"ball" of matrices 



For q = 0, this set corresponds to the set of matrices with rank at most r = p$, whereas for 
values q £ (0, 1], it consists of matrices whose (weighted) singular values decay at a relatively 
fast rate. By applying Theorem [2] to this matrix family, we obtain the following corollary: 

Corollary 2 (Estimation of near low-rank matrices). Suppose that the noise is zero- 
mean and sub-exponential, and 0* G M q (p q ) and has spikiness at most a sp (0*) < a*. With 
the same choice of X n as Corollary d there is a universal constant d 1 such that 



with probability greater than 1 — C2 exp(— C3log<i). 

Note that this result is a strict generalization of Corollary [TJ to which it reduces in the 
case q = 0. (When q = 0, we have po = r so that the bound has the same form.) Note that 
the price that we pay for approximately low rank is a smaller exponent — namely, 1 — q/2 as 
opposed to 1 in the case q = 0. The proof of Corollary [2] is based on a more subtle application 
of Theorem [2j one which chooses the effective rank r in the bound (|13p so as to trade off 
between the estimation and approximation errors. In particular, the choice r X p q ( d i™ gd ) q ^ 2 
turns out to yield the optimal trade-off, and hence the given error bound (|16p . 

In order to illustrate the sharpness of our theory, let us compare the predictions of our two 
corollaries to the empirical behavior of the M-estimator. In particular, we applied the nuclear 
norm SDP to simulated data, using Gaussian observation noise with variance v 2 = 0.25 and 
the uniform sampling model. In all cases, we solved the nuclear norm SDP using a non-smooth 
optimization procedure due to Nesterov [20j, via our own implementation in MATLAB. For 
a given problem size d, we ran T = 25 trials and computed the squared Frobenius norm error 
|||0 — 0*1 2 F averaged over the trials. 

Figure Q] shows the results in the case of exactly low-rank matrices (q = 0), with the matrix 
rank given by r = |dog 2 (<i)] . Panel (a) shows plots of the mean-squared Frobenius error versus 




(15) 




(16) 



S 



the raw sample size, for three different problem sizes with the number of matrix elements sizes 
d 2 £ {40 2 , 60 2 , 80 2 , 100 2 }. These plots show that the M-estimator is consistent, since each of 
the curves decreases to zero as the sample size n increases. Note that the curves shift to the 
right as the matrix dimension d increases, reflecting the natural intuition that larger matrices 
require more samples. Based on the scaling predicted by Corollary (H we expect that the 
mean-squared Frobenius error should exhibit the scaling |||@ — G*! 2 ? x rrfl ° gtj . Equivalently, 




Figure 1. Plots of the mean-squared error in Frobenius norm for q = 0. Each curve corre- 
sponds to a different problem size d 2 € {40 2 , 60 2 , 80 2 , 100 2 }. (a) MSE versus the raw sample 
size n. As expected, the curves shift to the right as d increases, since more samples should 
be required to achieve a given MSE for larger problems, (b) The same MSE plotted versus 
the rescaled sample size n/(rd\ogd). Consistent with Corollary [TJ all the plots are now fairly 
well-aligned. 

if we plot the MSE versus the rescaled sample size N := rd ™ d , then all the curves should 
be relatively well aligned, and decay at the rate 1/N. Panel (b) of Figure [T] shows the same 
simulation results re-plotted versus this rescaled sample size. Consistent with the prediction 
of Corollary m all four plots are now relatively well-aligned. Figure [2] shows the same plots for 
the case of approximately low-rank matrices (q = 0.5). Again, consistent with the prediction 
of Corollary [2j we see qualitatively similar behavior in the plots of the MSE versus sample 
size (panel (a)), and the rescaled sample size (panel (b)). 



3.3 Information-theoretic lower bounds 

The results of the previous section are achievable results, based on a particular polynomial- 
time estimator. It is natural to ask how these bounds compare to the fundamental limits of 
the problem, meaning the best performance achievable by any algorithm. As various authors 
have noted [U [12], a parameter counting argument indicates that roughly n ~ r (d r + d c ) 
samples are required to estimate an d r x d c matrix with rank r. This calculation can be made 
more formal by metric entropy calculations for the Grassman manifold (e.g., |29j): see also 
Rohde and Tsybakov [24] for results on approximation numbers for the more general ^ q -balls 
of matrices. Such calculations, while accounting for the low-rank conditions, do not address 
the additional "spikiness" constraints that are essential to the setting of matrix completion. 
It is conceivable that these additional constraints could lead to a substantial volume reduction 
in the allowable class of matrices, so that the scalings suggested by parameter counting or 
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Figure 2. Plots of the mean-squared error in Frobenius norm for q — 0.5. Each curve 
corresponds to a different problem size d 2 G {40 2 , 60 2 , 80 2 , 100 2 }. (a) MSE versus the raw 
sample size n. As expected, the curves shift to the right as d increases, since more samples 
should be required to achieve a given MSE for larger problems, (b) The same MSE plotted 



versus the rescaled sample size n/ (pq q/2 d\og d). Consistent with Corollary [21 all the plots are 
now fairly well-aligned. 

metric entropy calculation for Grassman manifolds would be overly conservative. 

Accordingly, in this section, we provide a direct and constructive argument to lower bound 
the minimax rates of Frobenius norm over classes of matrices that are near low-rank and not 
overly spiky. This argument establishes that the bounds established in Corollaries Q] and [2] 
are sharp up to logarithmic factors, meaning that no estimator performs substantially better 
than the one considered here. More precisely, consider the matrix classes 

M( Pq ) ={e g R dxd | 5>i( )' < p q , a sp (e) < y/m^d\, (17) 

corresponding to square d x d matrices that are near low-rank (belonging to the £ g -balls 
previously defined (|15p ). and have a logarithmic spikiness ratio. The following result applies 
to the minimax risk in Frobenius norm, namely the quantity 

m n (M(p q )) := inf sup E[|0 - Q*f F ] , (18) 
e*eB( P „) 

where the infimum is taken over all estimators that are measurable functions of n samples. 
Theorem 3. There is a universal numerical constant c§ > such that 

~ ( I ' v 2 d\ X ~^ u 2 d 2> \ 

m n (M(p q ))>c 5 minjpJ— J , — L (19) 

The term of primary interest in this bound is the first one — namely, p q (^-) 2 • It is the 
dominant term in the bound whenever the .^-radius satisfies the bound 

[v 2 d\i , , . 

P q <[ — ) d. (20) 
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In the special case q = 0, corresponding the exactly low-rank case, the bound (|20p always 
holds, since it reduces to requiring that the rank r = p$ is less than or equal to d. In these 
regimes, Theorem [3] establishes that the upper bounds obtained in Corollaries Q] and [2] are 
minimax-optimal up to factors logarithmic in matrix dimension d. 

3.4 Comparison to other work 

We now turn to a detailed comparison of our bounds to those obtained in past work on 
noisy matrix completion, in particular the papers by Candes and Plan [4] (hereafter CP) 
and Keshavan et al. [12] (hereafter KMO). Both papers considered only the case of exactly 
low-rank matrices, corresponding to the special case of q = in our notation. Since neither 
paper provided results for the general case of near- low rank matrices, nor the general result 
(with estimation and approximation errors) stated in Theorem our discussion is limited 
to comparing Corollary [1] to their results. So as to simplify discussion, we restate all results 
under the scalings used in this papei0 (i.e., with |||0*|||f = !)• 

3.4.1 Comparison of rates 

Under the strong incoherence conditions required for exact matrix recovery (see below for 
discussion), Theorem 7 in CP give an bound on |||0 — C*|||f that depends on the Frobenius 
norm of the error matrix H G W llXd2 , as defined by the noise variables [SWi) k(i) = £i i n 
our case. Under the observation model (pQ) and the scalings of our paper, as long as n > d, 
where d = d\ + c?2 — a condition certainly required for Frobenius norm consistency — we have 
I E \\f = ®{vy/n/d) with high probability. Given this scaling, the CP upper bound takes the 
form 

|0-e*|b<^{v^+^}. (21) 

Note that if the noise standard deviation u tends to zero while the sample size n, matrix size 
p and rank r all remain fixed, then this bound guarantees that the Frobenius error tends to 
zero. This behavior as v — > is intuitively reasonable, given that their proof technique is an 
extrapolation from the case of exact recovery for noiseless observations {y = 0). However, 
note that for any fixed noise deviation v > 0, the first term increases to infinity as the 
matrix dimension d increases, whereas the second term actually grows as the sample size n 
increases. Consequently, the CP results do not guarantee statistical consistency, unlike the 
bounds proved here. 

Keshavan et al. [12] analyzed alternative methods based on trimming and applying the 
SVD. For Gaussian noise, their methods guarantee bounds (with high probability) of the form 




where cfo/rfi is the aspect ratio of 0*, and k(Q*) = am f2.»( is the condition number of 
0* . This result is more directly comparable to our Corollary [TJ apart from the additional 
factor involving either the aspect ratio or the condition number, it is sharper since it does 
not involve the factor logd present in our bound. For a fixed noise standard deviation v, 

1 The paper CP and KMO use two different sets of scaling, one with |||0*|||f = 6(d) and the other with 
III®* if = so that some care is required in converting between results. 
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the bound (|22|) guarantees statistical consistency as long as ^ tends to zero. The most 
significant differences are the presence of the aspect ratio d^/dx or the condition number 
k{Q*) in the upper bound (f22]) . The aspect ratio is a quantity that can be as small as one, 
or as large as g?2) so that the pre-factor in the bound (|22p can scale in a dimension-dependent 
way. Similarly, for any matrix with rank larger than one, the condition number can be made 
arbitrarily large. For instance, in the rank two case, define a matrix with <r max (G*) = \/l — 5 2 
and <r m i n (@*) = <5, and consider the behavior as 5 — > 0. In contrast, our bounds are invariant 
to both the aspect ratio and the condition number of 0* . 



3.4.2 Comparison of matrix conditions 

We now turn to a comparison of the various matrix incoherence assumptions invoked in the 
analysis of CP and KMO, and comparison to our spikiness condition. As before, for clarity, we 
specialize our discussion to the square case (d r = d c = d), since the rectangular case is not es- 
sentially different. The matrix incoherence conditions are stated in terms of the singular value 
decomposition O* = UT>V T of the target matrix. Here U G M dxr and V G M. dxr are matri- 
ces of the left and right singular vectors respectively, satisfying U T U = V T V = I r xr, whereas 
£ G R rxr is a diagonal matrix of the singular values. The purpose of matrix incoherence is to 
enforce that the left and right singular vectors should not be aligned with the standard basis. 
Among other assumptions, the CP analysis imposes the incoherence conditions 

||^ T -^dxd||oo ||^ T - ^xdlloo </^. and \\UV T \\ 00 <fi^-, (23) 

for some constant fi > 0. Parts of the KMO analysis impose the related incoherence condition 
max \UU T \jj < Mo~7j an d niax \W T \jj < Mo^- (24) 



Both of these conditions ensure that the singular vectors are sufficiently "spread-out" , so as 
not to be aligned with the standard basis. 

A remarkable property of conditions (|23p and (|24|) is that they exhibit no dependence on 
the singular values of 0*. If one is interested only in exact recovery in the noiseless setting, 
then this lack of dependence is reasonable. However, if approximate recovery is the goal — as 
is necessarily the case in the more realistic setting of noisy observations — then it is clear that 
a minimal set of sufficient conditions should also involve the singular values, as is the case for 
our spikiness measure a sp (0*). The following example gives a concrete demonstration of an 
instance where our conditions are satisfied, so that approximate recovery is possible, whereas 
the incoherence conditions are violated. 

Example. Let T G R rfxrf be a positive semidefinite symmetric matrix with rank r — 1, 
Frobenius norm |||r|||i? = 1 and ||r[|oo < co/d. For a scalar parameter t > 0, consider the 
matrix 

6*:=r + teief (25) 

where e\ G W 1 is the canonical basis vector with one in its first entry, and zero elsewhere. By 
construction, the matrix 0* has rank at most r. Moreover, as long as t = 0(l/d), we are 
guaranteed that our spikiness measure satisfies the bound a sp (0*) = 0(1). Indeed, we have 
1 0* If > III T |||f — t = 1 — t, and hence 

Men = f^<^&±i<^ = 0(1 ). 

U F 1 — 1 1 — t 
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Consequently, for any choice of T as specified above, Corollary Q] implies that the SDP will 

recover the matrix B* up to a tolerance 0{yj lI^iH). This captures the natural intuition that 

"poisoning" the matrix T with the term tefei should have essentially no effect, as long as t 
is not too large. 

On the other hand, suppose that we choose the matrix T such that its r — 1 eigenvectors 
are orthogonal to e±. In this case, we have G*ei = te\, so that e\ is also an eigenvector of G*. 
Letting U G M. dxr be the matrix of eigenvectors, we have efUU T ei = 1. Consequently, for 
any fixed fx (or hq) and rank r <C d, conditions (|23p and (124p are violated. 





4 Proofs for noisy matrix completion 

We now turn to the proofs of our results. This section is devoted to the results that apply 
directly to noisy matrix completion, in particular the achievable result given in Theorem [21 its 
associated Corollaries [T] and [21 and the information-theoretic lower bound given in Theorem [3l 
The proof of Theorem [T] is provided in Section [5] to follow. 



4.1 A useful transformation 

We begin by describing a transformation that is useful both in these proofs, and the later 
proof of Theorem [TJ In particular, we consider the mapping i— > T := ^/RQ^/C, as well as 
the modified observation operator 3£ n ' : M. dxd — > M. n with elements 

[3Z n \T)] i = {{X®, T», fori = l,2,...,n, 
where X® := R~ 1/2 X^C~ 1/2 . Note that X n '(T) = X n (Q) by construction, and moreover 

W^Wf = |||0||L(F) ) III r |||i = |||©||L(i)> and HI r III oo = III ©|||^(oo) 5 
which implies that 

^(6) = and a sp (G) = ^&. (26) 

III 1 lll-F III 1 \\\f 

^a(r) < P (r) 

Based on this change of variables, let us define a modified version of the constraint set ([HJ) as 
follows 

€!{n;co)={o*TeR d * d \ a' sp (T) f? n (T) < }• (27) 

In this new notation, the lower bound ([9]) from Theorem Q] can be re-stated as 

l|X "' (r) " 2 > i|T|,{l - 128 °-; (r) } for aJ! r 6 (28) 



n a \ n 



4.2 Proof of Theorem H 

We now turn to the proof of Theorem [21 Defining the estimate T := we have 

TGarg min {±-\\y - 3^{T)\\ 2 2 + A n |r|li}, (29) 
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and our goal is to upper bound the ordinary Frobenius norm |||r — r*|||j7. 

We now state a useful technical result. Parts (a) and (b) of the following lemma were 
proven by Recht et al. [23] and Negahban and Wainwright |19| . respectively. 

Lemma 1. Let (U,V) represent a pair of r- dimensional subspaces of left and right singular 
vectors of T* . Then there exists a matrix decomposition A = A' + A" of the error A such 
that 

(a) The matrix A' satisfies the constraint rank(A') < 2r, and 

(b) Given the choice (112j) . the nuclear norm of A" is bounded as 

I A" |||i < 3 1 A' |||i + 4 a i( T *)- ( 30 ) 

j=r+l 

Note that the bound ([30]) . combined with triangle inequality, implies that 

d r 

I A 1 1 < I A' I i + l A" Hi < 4|||A , |||i +4 a j( T *) 

j=r+l 

<8v^|||A||| F + 4 £; a 3 (F*) (31) 

j=r+l 

where the second inequality uses the fact that rank(A') < 2r. 

We now split into two cases, depending on whether or not the error A belongs to the set 
C'(n;co). 

Case 1: First suppose that A ^ £'(n;co). In this case, by the definition (I27p . we have 



||A|||^ < c (y 7 ^!! A||oo) I A|||i ' i<n °~'" 



n 



< 2 C0 alA|||i ldl ° gd 



n 



since IIAHoo < ||r*||oo + llTHoo < , 2 ° . Now applying the bound ([31~]) . we obtain 

(It 



A|||^<2coa%/^^{8v^|||A||| F + 4 £ <x,(r*)}. (32) 



n 

j=r+l 



Case 2: Otherwise, we must have A G <t'(n;co). Recall the reformulated lower bound (|28p . 
On one hand, if 128as p^ > 1/2, then we have 



256 V / ^4||A|| 00 512a* 
W&tF < 7= < — (33) 
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On the other hand, if 128< ^ (A) < 1/2, then from the bound we have 

||S„<(A)|| 2 £ |A|j, 



n 16 



with high probability. Note that T is optimal and F* is feasible for the convex program (|29h . 
so that we have the basic inequality 

^-\\y - Xn'(f + Anllf Hi < - X n '(r)||2 + A n ||r*|i. 
In In 

Some algebra yields 

^\\X n '(A)\\ 2 2 < v ((A, i f>X«» + A n |||r* + A|i - A n |||A|||. 
In n ^-^ 

Substituting the lower bound (J33J) into this inequality yields 
II All 2 _ 1 n 

LmI < v ((A, - V&x w }> + A„|r* + A|i - A n ||A|. 
512 n 

i=i 

From this point onwards, the proof is identical (apart from constants) to Theorem 1 in Ne- 
gahban and Wainwright [TI5], and we obtain that there is a numerical constant c\ such that 

|||A|||£ < ci a* A J v^|A|f + Yl a ^ • ( 35 ) 

^ j=r+l ' 

Putting together the pieces: Summarizing our results, we have shown that with high 
probability, one of the three bounds (f32|) . (i33l) or (|35l) must hold. Since a* > 1, we can 
summarize by claiming that there is a universal constant c\ such that 

|||A||||,<ci max[A n ,^^} [VF|A|j, + £ • 

Translating this result back to the original co-ordinate system (r* = \^R@*y/C) yields the 
claim ((131). 



4.3 Proof of Corollary Q] 

When 9* (and hence VRG*VC) has rank r < d r , then we have Ej=r+l <rj(VRG*VC) = 0. 
Consequently, the bound (JT3J) reduces to ||| Afl^^) < c\ a* A* y/r. To complete the proof, it 



suffices to show that 



Irp^^Wc-^lb > Cl vJ^d] < c 2 exp(-c 2 dlogd). 
' n V n J 

i=l 



We do so via the Alhswede- Winter matrix bound, as stated in Appendix [Fj Defining the 
random matrix := ^i?" 1 / 2 ^) C" 1 / 2 , we first note that is sub-exponential with pa- 
rameter 1, and |i2- 1 / 2 xWC -1 / 2 | has a single entry with magnitude at most L\Jd r d c , which 
implies that 

\\Y (i) \\^ <Lv yfd~d c < 2v Ld 
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(Here || • denotes the Orlicz norm [15] of a random variable, as defined by the function 
ip\{x) = exp(x) — 1; see Appendix|F|). Moreover, we have 

E[(y«) T yW] =^ 2 E r drdc - " ~ T 



v 2 E 



dip 

7? ^ e Hi) e k(i) 

Rj(i) <^k(i) 



1/2 drIdcXd c 



so that |||E[(y( i )) T y( i )]|||2 < 2u 2 d, recalling that 2d = d r + d c > d r . The same bound applies 
to |||E[yW(yW) T ]|||2, so that applying Lemma[7]with t = n8, we conclude that 



1 n ft 

P[|||-^^ii- 1/2 xWc , - 1/2 ||| 2 > 5] < (d r x d c ) max{exp(-n5 2 /(16^ 2 (i), ex P("^^)} 

i=l 

Since \J d r d c < d r + d c = 2d, if we set S 2 = c 2 v 2 dlc ^ d for a sufficiently large constant c\, the 



result follows. (Here we also use the assumption that n = Q(dlogd), so that the term y rfl ° gd 
is dominant.) 



4.4 Proof of Corollary [2] 

For this corollary, we need to determine an appropriate choice of r so as to optimize the 
bound (fT3j) . To ease notation, let us make use of the shorthand notation V* = \^R@* yC ' . 
With the singular values of T* ordered in non-increasing order, fix some threshold r > to 
be determined, and set r = max{j | (jj{T*) > r}. This choice ensures that 

E ;<n-T E ^p- s t E C 2 ^)* s r-v,. 

j'=r+l j=r+l j'=r+l 

Moreover, we have rr q < Y^=i { a j(^*)} q — Pq> which implies that y/r < ^/~PqT~ q l 2 . Substi- 
tuting these relations into the upper bound (fT3|) leads to 



|||A||| 2 (F) < d a* A; [^r-«/ 2 |||AL (F) +r 1 -V <? } 

In order to obtain the sharpest possible upper bound, we set r = a* A*. Following some 
algebra, we find that there is a universal constant c\ such that 

|A||^ ) <c 1 ^((a*) 2 (A;) 2 ) 1 -l 

As in the proof of CorollaryOQ it suffices to choose A n = so that X* n = 0^{u 2 + 1)^^), 

from which the claim follows. 

4.5 Proof of Theorem [3] 

Our proof of this lower bound based on a combination of information-theoretic methods |33[ [32] , 
which allow us to reduce to a multiway hypothesis test, and an application of the probabilistic 
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method so as to construct a suitably large packing set. By Markov's inequality, it suffices to 
prove that 



sup 

e*el(p„) 



19 - e*tl > 25 



1 

> -. 

~ 2 



In order to do so, we proceed in a standard way — namely, by reducing the estimation problem 
to a testing problem over a suitably constructed packing set contained within M(p q ). In 
particular, consider a set {Q 1 , . . . ,Q M ^} of matrices, contained within M(p q ), such that 
_ Q^ll^ > S for all t ^ k. To ease notation, we use M as shorthand for M(8) through 
much of the argument. Suppose that we choose an index V G {1,2, . . . , M} uniformly at 
random (u.a.r.), and we are given observations y £ W 1 from the observation model ((3|) with 
q* _ qV Then triangle inequality yields the lower bound 



sup P 

i) 



e*ei 



ie -e* lb- > 5 



If we condition on X n , a variant of Fano's inequality yields 

((?))- 1 E^ fe £>(® /! II e^) + io g 2 



F[V^V \ X n ] > 1 



logM 



(36) 



where D(Q k || 0^) denotes the Kullback-Leibler divergence between the distributions of 
(y\X n ,@ k ) and (y\X n , ). In particular, for additive Gaussian noise with variance v 2 , we 
have 



D(9 k || 0^ 



^\\x n (e k )-x n (e £ )\\l 



and moreover, 



E^^ce* || e e )] = _ |0 fc -0 



2is 2 

Combined with the bound (1361). we obtain the bound 



II 2 
II F ■ 



[V^V]=E Xn {F[V^V | X n ]} 

O)- 1 £^!ie fc -0^ + io g 2 



> i 



logM 



(37) 



The remainder of the proof hinges on the following technical lemma, which we prove in 
Appendix 1X1 

Lemma 2. Let d > 10 be a positive integer, and let 5 > 0. Then for each r = 1,2, ... ,d, 
there exists a set of d- dimensional matrices {0 1 , . . . , Q M } with cardinality M = |_| exp (f^r)J 
such that each matrix has rank r, and moreover 



\\e e \lF 



|0^-0 A 



> 5 



for all 1 = 1,2,..., M, 
for all t^k, 



a sp {@ e ) < a/32 log d for alii = 1,2,..., M, and 



sp 

10 



Ion < — = for all 



1,2, 



M. 



(38a) 
(38b) 
(38c) 

(38d) 
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We now show how to use this packing set in our Fano bound. To avoid technical compli- 
cations, we assume throughout that rd > 1024 log 2. Note that packing set from Lemma [2] 
satisfies |||0 fc — 0^|||_f < 25 for all k ^ £, and hence from Fano bound (|37p . we obtain 



> 1 

> 1 

If we now choose 5 2 = ^jg^, then 



2^ + log 2 

— - loe 4 

128 iU & ^ 

2?€ +log2 



rd 
256 



512^- + 256 log 2 
rd 



^ + 256 log 2 1 

F[V?V]>1- -± — -2- > -, 

rd 2 

where the final inequality again uses the bound rd > 1024 log 2. 

In the special case q = 0, the proof is complete, since the elements all have rank r = Ro, 
and satisfy the bound a sp (@ e ) < -y/32 log d. For q G (0, 1], consider the matrix class B(p 9 ), 

q 

and let us set r = min{d, \p q (~) in Lemma [21 With this choice, since each matrix Q £ 
has rank r, we have 

so that we are guaranteed that 0^ £ B(/3 g ). Finally, we note that 



rd . r /(A 1- * d 2 , 
— > mm{p q - , — ), 
n \n / n 



so that we conclude that the minimax error is lower bounded by 

2j\ 1-f „2 J2 



i . r /vdv - * ^ 



mm p 9 



4096 I V n / n 

for dr sufficiently large. (At the expense of a worse pre-factor, the same bound holds for all 
d > 10.) 



5 Proof of Theorem [T] 

We now turn to the proof that the sampling operator in weighted matrix completion satisfies 
restricted strong convexity over the set C, as stated in Theorem [TJ In order to lighten notation, 
we prove the theorem in the case d r = d c . In terms of rates, this is a worst-case assumption, 
effectively amounting to replacing both d r and d c by the worst-case max{d r , d c }. However, 
since our rates are driven by d = \{d r + d c ) and we have the inequalities 

- max{<i r , d c } < -(d r + d c ) < max{d r , d c }, 

this change has only an effect on the constant factors. The proof can be extended to the 
general setting d r ^ d c by appropriate modifications if these constant factors are of interest. 
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5.1 Reduction to simpler events 

In order to prove Theorem [IJ it is equivalent to show that, with high probability, we have 



|£„'(r)|| 2 > I| r | F _^ Ld ll r lloo 



n 



for all T 6 £'(n;c ) 



The remainder of the proof is devoted to studying the "bad" event 



£(X n '):=^ Te<t'(n;c ) 



n 



^ 7 |||r||| , 48Ld||r|| 00 
> - i \\\f H 



n 



(39) 



(40) 



Suppose that £(X n ') does not hold: then we have 

\Xn(F)\\ 2 a . 



n 



7,,,^,,, 48Ld|r| DO 

<dir|F + — 



for all r e €'(n;co), 



which implies that the bound (|39p holds. Consequently, in terms of the "bad" event, the claim 
of Theorem [1] is implied by the tail bound F[£(3£ n ')] < 16exp(— c'd log d). 

We now show that in order to establish a tail bound on £(£ n ), it suffices to bound 
the probability of some simpler events £(X n ';D), defined below. Since the definition of the 
set £'(n;co) and event £(X n ') is invariant to rescaling of T, we may assume without loss of 
generality that \\T\\oo = \- The remaining degrees of freedom in the set £'(n;co) can be 
parameterized in terms of the quantities D = \T\p and p = |||r|||i. For any T € £'(n; cq) with 



|r||oo = 2 an d |||r|||j7 < D, we have |||r|||i < p(D), where 

D 2 



p(D) :- 

For each radius D > 0, consider the set 

©(£>) :={reC(n;co) | ||r|| 
and the associated event 



co 



d log d 



(41) 



f(3£„';D):=|3 fe«B(D) 



oo = ^, ||r|F< A llirnu < ^(z?)}, 



Pn'(r)|| 2 |||r||| | 3 4811 

= 1 F > ~ A U + —j= >■ 

n 1 4 J 



(42) 



(43) 



The following lemma shows that it suffices to upper bound the probability of the event 
£ (X n ; D) for each fixed D > 0. 

Lemma 3. Suppose that are universal constants (ci,C2) sitc/i that 

F[£(X n '; D)] < ci exp(-c 2 nD 2 ) (44) 
/or eac/i /jxec? D > 0. T/ien i/iere is a universal constant c' 2 such that 

nSM] < c l T ^^ - (45) 
1 — exp(— c^dlogd) 

The proof of this claim, provided in Appendix (Bj follows by a peeling argument. 



19 



5.2 Bounding the probability of S(X n ';D) 

Based on Lemma El it suffices to prove the tail bound (|44l) on the event £ (X n ; D) for each 
fixed D > 0. Let us define 



Z n (D) := sup 
re3S(D) 



l*n'(r)|| s 



ft 



(46) 



where 



®(£>) := {r e c'(n;co) | Hriu < -, ||r| F < A ||r|i < P (D)}. 



(47) 



(The only difference from *B(D) is that we have relaxed to the inequality ||r||oo < :?•) In the 
remainder of this section, we prove that there are universal constants (01,02) such that 



q 48/ ft/} 2 
'[ZnfD) > -D + -=1 < ci exp(-c 2 ^-) for each fixed L> > 0. 
4 -v/fi i 



(48) 



This tail bound means that the condition of Lemma [3] is satisfied, and so completes the proof 
of Theorem [TJ 

In order to prove (|48p . we begin with a discretization argument. Let T 1 ,...,T Nt & be 
a 5-covering of 23(D) in the Probenius norm. By definition, given an arbitrary T E 03(D), 
there exists some index k E {1, . . . , N(S)} and a matrix A E M rfxrf with ||| A||| F < 5 such that 
r = T k + A. Therefore, we have 



\Xn'(T)\\ 2 



x n '(r fc + A)|| 2 



ft 



ir fe + A| 



||Sn'(r fc )|| 2 ||x^(A)|| 2 fc 

< - 1 7= P lll-F + III^IIIf 



< 



n 

ffnk\ 



11 



|x n '(r 



im/c 11 



n 



J^(A)|| 2 
H f= h 0, 



where we have used the triangle inequality. Following the same steps establishes that this 
inequality holds for the absolute value of the difference. 

Moreover, since A = r fc -r with both T k and V belonging to <B(D), we have |||A|||i < 2p(D) 
and || Alloc < I, where we have used the definition (1421) . Putting together the pieces, we 
conclude that 



Z n (D) < 5 + max 

k=l,...,N(5) 

where 



l^n'(r fc )|| 2 



II 'P^ III 

1 \\\f 



+ sup 



|3En'(A)||2, 



71 



(49) 



J)(S,R) := {A E 



ttdxd 



||A||| F <<5, I A|||i < 2p(D), IIAIU < -}. 



(50) 



Note that the bound (j4~9l) holds for any choice of 5 > 0. We establish the tail bound (I4*8j) 
with the choice (5 = D/8, an d using the following two lemmas. The first lemma provides 
control of the maximum over the covering set: 
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Lemma 4. As long d > 10, we have 



max 

k=l,...,N(D/8) 



Xn'(T k )\\ 2 |rfc . D 1SL 



< V + -F- (51) 



y/n 8 y/n 

with probability greater than 1 — cexp ( — ^Ijfitl? ) • 
See Appendix O for the proof of this claim. 

Our second lemma, proved in Appendix [Dj provides control over the final term in the 
upper bound (|49|) , 



Lemma 5. 

\Xn'(A)\\ 2 ,D 



sup 

AeD(f ,R) 



with probability at least 1 — 2 exp ( — s ^ L -2 



< — 

n 1 ~ 2 



Combining these two lemmas with the upper bound (j49]) with 5 = D/8, we obtain 

, . D D 48L D 
Z n (D) <- + - + — + - 
8 8 \Jn I 

3D 48L 

4 */n 



with probability at least 1 — 4 exp ( — thereby establishing the tail bound (|4*8|) and 

completing the proof of Theorem [H 



6 Discussion 

In this paper, we have established error bounds for the problem of weighted matrix completion 
based on partial and noisy observations. We proved both a general result, one which applies 
to any matrix, and showed how it yields corollaries for both the cases of exactly low-rank 
and approximately low-rank matrices. A key technical result is establishing that the matrix 
sampling operator satisfies a suitable form of restricted strong convexity [18] over a set of 
matrices with controlled rank and spikiness. Since more restrictive properties such as RIP do 
not hold for matrix completion, this RSC ingredient is essential to our analysis. Our proof of 
the RSC condition relied on a number of techniques from empirical process and random matrix 
theory, including concentration of measure, contraction inequalities and the Ahlswede- Winter 
bound. Using information-theoretic methods, we also proved that up to logarithmic factors, 
our error bounds cannot be improved upon by any algorithm, showing that our method is 
essentially minimax-optimal. 

There are various open questions that remain to be studied. Although our analysis applies 
to both uniform and non-uniform sampling models, it is limited to the case where each row (or 
column) is sampled with a certain probability. It would be interesting to consider extensions 
to settings in which the sampling probability differed from entry to entry, as investigated 
empirically by Salakhutdinov and Srebro [25J. 
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A Proof of Lemma [2] 

We proceed via the probabilistic method, in particular by showing that a random procedure 
succeeds in generating such a set with probability at least 1/2. Let M' = exp (j^), and for 
each t = 1, . . . , M' , we draw a random matrix <Er £ M. dxd according to the following procedure: 

(a) For rows i = 1, . . . , r and for each column j = 1, . . . , d, choose each 0|- £ {—1, +1} 
uniformly at random, independently across 

(b) For rows i = r + 1, . . . , d, set 0^ = 0. 

We then let Q £ M dxd be a random unitary matrix, and define O^ = -7= Q<Er for all 

£ = 1, . . . , M'. The remainder of the proof analyzes the random set {O 1 , • • • , Q M '}, and shows 
that it contains a subset of size at least M = M'/4 that has properties (a) through (d) with 
probability at least 1/2. 

By construction, each matrix 0^ has rank at most r, and Frobenius norm |||0^|||_f = \prd. 
Since Q is unitary, the rescaled matrices 0^ have Frobenius norm |||0^|||f = <5- We now prove 
that 

\\@ e - e k \\ F > 8 for all ^/fc 

with probability at least 1/8. Again, since Q is unitary, it suffices to show that |||0 £ — fc |||_F > Vrd 
for any pair I ^ k. We have 

^l|e*-e'||| = ^EE(^-^-) 2 - 

i=l j=l 

This is a sum of rd i.i.d. variables, each bounded by 4. The mean of the sum is 2, so that the 
Hoeffding bound implies that 

p ^lll@ fc - <2-t| < 2exp(-rdi 2 /32). 

rd 

Since there are less than (M') 2 pairs of matrices in total, setting t = 1 yields 

\\\Ql _ @fe|||2 r( l 7 

F\ min — ^>ll>l-2exp( h21ogM') > -, 

l e,k=i,...,M' rd - J- PV 32 & ; ~8 

where we have used the facts logM' = jr^ and d > 10. Recalling the definition of 0^, we 
conclude that 

Pf min |||0 £ - O k \jl > 5 2 ] > -. (52) 
1 l,k=l,...,M> 111 8 y J 

We now establish bounds on a sp (0^) and |||0^|||2- We first prove that for any fixed index 
£ G {1,2,..., M'}, our construction satisfies 

a sp (0^) < V321ogd] > j. (53) 
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Indeed, for any pair of indices we have |0|-| = \(qi, Vj}\, where qi G M rf is drawn from 

the uniform distribution over the d-dimensional sphere, and ||wj||2 = \fr ^5= = ■ By 

Levy's theorem for concentration on the sphere |14j . we have 



V[\{qi,Vj)\ >t] <2exp( 



f / 2 



Setting t = s/d and taking the union bound over all d 2 indices, we obtain 

1 



*[d ||6 Hoc >s]< 2exp 



■|e*ii 



-s z + 2\ogd 



This probability is less than 1/2 for s = ||| If V32 log d and d > 2, which establishes the 
intermediate claim ([53]) . 

Finally, we turn to property (d). For each fixed £, by definition of Q e and the unitary 
nature of Q, we have |||0 £ |||op = "^r^l^Uki where U £ {—1, +l} rxd is a random matrix with i.i.d. 
Rademacher (and hence sub-Gaussian) entries. Known results on sub-Gaussian matrices f?| 
yield 

*\lU\j 2 < ^=(V^ + Vd)] >l-2exp(-i(^ + v / d) 2 ) > - 
ird vrd J 4 4 



for ci > 10. Since r < d, we conclude that 



ie% < « 



~ 4 



(54) 



By combining the bounds (|53p and (|54p . we find that for each fixed £ = 1, . . . , M' , we have 

1 



|e<|, < " 2^> < 



> 



(55) 



Consider the event £ that there exists a subset S 1 C {1, . . . , M'} of cardinality M = \M' such 
that 

|||e £ ||| 2 < 4a/?, and ^£M<y^Io^ for all £ E S. 
V n 0F 



By the bound (|5o| . we have 



> 



fc=M 



Since we have chosen M < M'/2, we are guaranteed that P[£] > 1/2, thereby completing the 
proof. 



B Proof of Lemma [3] 

We first observe that for any T 6 C'(n; cq) with HrHoo = g, we have 



HPIII2 ^ in nil dlo & d ^ mrlll / dlo S d 

|r| F > c |r|i — - — > c |r| F ■ 



n 



n 
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whence |||r|||jr > Co \J dlo & d . Accordingly, recalling the definition (|42p . it suffices to restrict our 
attention to sets 25(-D) with D > [i := cosj^j^- For £ = 1,2,... and a = |, define the sets 

:= {T G £'(n;co) | [|r||oo = -, a'~V < |||r||| F < c//i, and |||r|||i < p(a e fi)}. (56) 

From the definition (|42p . note that by construction, we have C 58 (a /i). 

Now if the event £(X n ') holds for some matrix T, then this matrix T must belong to some 
set E>£. When T 6 E>£, then we are guaranteed the existence of a matrix F 6 *B(c//i) such that 

lPn'(r)|| 2 l||r||| .7 48L 

U \\\f\ > — 1 F + 



n 1 A /n 



^ 7 48L 
> -of V + 



3 | 48L 



-a M + - 

4 



where the final equality follows since a = 7/6. Thus, we have shown that when the violating 
matrix r 6 §£, then event £(X n ';a e p) must hold. Since any violating matrix must fall into 
some set Sg, the union bound implies that 

00 

00 

< ci exp ( — C2na 2i fj 2 ) 

e=i 

00 

< ci exp ( — 2c2 log(o) I np 2 ) 

t=\ 

< 1 exp(-c / 2 n/i 2 ) 
~~ 1 — exp(— c' 2 n/i 2 ) 

Since n/i 2 = ri(filogd), the claim follows. 

C Proof of Lemma |4] 

For a fixed matrix T, define the function Fr(X n ') = -^||3£ n '(r)||2. We prove the lemma in 
two parts: first, we establish that for any fixed T, the function Fr satisfies the tail bound 

PpFrM - irlH > 6 + < 4exp ( - (57) 



We then show that there exists a ^-covering of 03(D) such that 



log N(5) < m(p(D)/5) 2 d. (58) 
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Combining the tail bound (|57p with the union bound, we obtain 

P[ max |Fr(X n / )-|||r fc ||| F | > 5 + ^] < 4exp ( - — I + logJV(ff)) 
fc=i,...,A r (<5) yn 4i/ 

*2 



<4exp{-^+36(p(£>)/«J) 2 d} 



where the final inequality follows uses the bound (|58|) . Since Lemma [H is based on the choice 
5 = D/8, it suffices to show that 

nD 2 o 
>36{p(D)/(D/8)) 2 d 



512 L 2 

36 ( c \fdlogd) 
2304 D 2 n 



Noting that the terms involving D 2 and n both cancel out, we see that for any fixed cq, this 
inequality holds once log d is sufficiently large. By choosing Co sufficiently large, we can ensure 
that it holds for all d > 2. 

It remains to establish the two intermediate claims (1571) and (1581) . 



Upper bounding the covering number (I58p : We start by proving the upper bound (158j) 
on the covering number. To begin, let N(8) denote the <5-covering number (in Frobenius 
norm) of the nuclear norm ball Bj (/>(£>)) = {A G M dxd | |||A|||i < p(D)}, and let N(S) be 

the covering number of the set *B(D). We first claim that N(5) < N(S). Let {r 1 , . . . ,r^^} 
be a (5-cover of M%(p(D)), From equation (|17|). note that the set is contained within 

Mi(p(D)); in particular, it is obtained by intersecting the latter set with the set 

S := {A G M dxd | HAIloo < i I A||| F < L>}. 

Letting IL5 denote the projection operator under Frobenius norm onto this set, we claim that 
{Us(T 3 ),j = 1, . . . , N(5)} is a 5-cover of 23(D). Indeed, since S is non-empty, closed and 
convex, the projection operator is non-expansive [3], and thus for any T G 55 (D) C S, we 
have 

|n 5 (r^-r||F = |n 5 (p) -n 5 (r)| F < |p -r| F , 

which establishes the claim. 

We now upper bound N(5). Let G G M rfxrf be a random matrix with i.i.d. iV(0, 1) entries. 
By Sudakov minoration (cf. Theorem 5.6 in Pisier [H]), we have 



log N(S) <^E[ sup ((G, A))] 

d ll|A|||i<p(D) 

<^E[|G| 2 ], 
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where the second inequality follows from the duality between the nuclear and operator norms. 
From known results on the operator norms Gaussian random matrices [7J, we have the upper 
bound miG\l 2 ] < 2y/d, so that 



'logiV(5) < 6 -<f± d , 
thereby establishing the bound ([58]) . 

Establishing the tail bound (15Tf) : Recalling the definition of the operator X n ', we have 

mxn) = ^={ir((xv, r)) 2 } 1/2 

v 1=1 



4= sup f>«x« r>; 

V n |[u|[ 2 =l i=1 



—= sup V^Y; 



where we have defined the random var iables Y, := ((X® , V)). Note that each is zero-mean, 
and bounded by 2L since 

N = l«^ (i) , r»| 

< (^|x«u) iiriu < 2L. 

a, 6 

where we have used the facts that ||r||oo < 2/d, and Ylab \X®\ab < L d,by definition of the 
matrices 

Therefore, applying Corollary 4.8 from Ledoux [14] . we conclude that 
F[\F T (X n ') - E[Fr(X n ')}\ > 5 + -=] < 4exp ( - 



The same corollary implies that 

E[F^X n ')]-E[F r (X n ')}\ < 

Since E[Fj?(£„')] = lll r lllF' the tail bound §7} follows. 

D Proof of Lemma [5] 

From the proof of Lemma [H recall the definition Fr(X n ') = ^ll^n'(r)||2 where X n ' is the 

random sampling operator defined by the n matrices (X^\ . . . ,X^). Using this notation, 
our goal is to bound the function 

G(X n ') := sup F A (X n '), 



2G 



where we recall that D(S,R) := {A G Rd r xd c | | A | F < ^ ||| A.|||i < 2p(D), \\A\\oc < f}- 
Ultimately, we will set S = ^, but we use 5 until the end of the proof for compactness in 
notation. 

Our approach is a standard one: first show concentration of G around its expectation 
E[G(X n ')], and then upper bound the expectation. We show concentration via a bounded 
difference inequality; since G is a symmetric function of its arguments, it suffices to establish 
the bounded difference property with respect to the first co-ordinate. In order to do so, 
consider a second operator X n ' defined by the matrices (Z^\ X^ 2 \ . . . differing from 

X n ' only in the first matrix. Given the pair (X n ' ,X n '), we have 

G(X n ') - G(3E?) = sup F A (X n ')- sup F e (xZ') 

AeS(<5,R) eeS(<5,_R) 

< sup [F A (X n ') - F A (xZ')] 

AeD(<5,R) 

< sup -^||£ n '(A)-i/(A)|| 2 
Ae2)(<5,R) V n 



L|((x(i)_ Z (i) ; A) ) I 



= sup 

Aes(<5,/?) V n 

For any fixed A E D(5,R), we have 

- Z {1 \ A)) | <2Ld||A|| 00 < 4L, 

where we have used the fact that the matrix X^> — Z^> is non-zero in at mostjrwo entries 
with values upper bounded by 2Ld. Combining the pieces yields G(X n ') — G(X n ') < 

Since the same argument can be applied with the roles of X n ' and X n ' interchanged, we 
conclude that \G(X n ') — G(X n ')\ < Therefore, by the bounded differences variant of the 
Azuma-Hoeffding inequality |14j . we have 

n t 2 

F[\G(X n ')-E[G(X n ')]\>t] <2ex P (-^). (59) 



Next we bound the expectation. First applying Jensen's inequality, we have 
(E[G(£„')D 2 < nG 2 (X n ')} 

1 71 ~ 
= E[ sup A)) 2 ] 



AeS(8,R) n 



8=1 



E 



< E 



r i n 

sup A)) 2 - E[«X«, A)) 2 ]] + \\Af F 



( 1 n 

sup A» 2 -E[«X«, A}) 2 ]] 

A&(S,R) 



n 
i=i 



+ 5 2 



where we have used the fact that E[((XW, A)) 2 = |||A|||^ < 5 2 . Now a standard symmetrization 
argument [15] yields 

1 n 

E Xn ,[G 2 (X n ')} <2E Xn , i£ [ sup -J>«* (i) , A)) 2 ]+5 2 , 

a<e®(5,r) n ^ 
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where {ei}f =1 is an i.i.d. Rademacher sequence. Since A)) | < 2L for all i, the Ledoux- 

Talagrand contraction inequality (p. 112, Ledoux and Talagrand [15]) implies that 



E[G 2 (£ n ')] <16LE[ sup {- f>«* W , A))}] + 5 2 . 

AeS)(<5,H) n ~[ 
By the duality between operator and nuclear norms, we have 

i±5>«* (<) > A»i<|ix;^ (i) i2iAii, 

i=l i=l 

and hence, since |||A|||i < p(D) for all A 6 %)(5,R), we have 

1 n 

E[G 2 (£ n ')] < 16Lp(D) E[|-]Te 4 X«|| 2 ] +S 2 . (60) 



n 
i=i 



It remains to bound the operator norm E [||| ^ X^ILi £ iX^ I2] • The following lemma, proved 
in Appendix [El provides a suitable upper bound: 

Lemma 6. We have the upper bound 



„r,„ 1 x— * ~a\m 1 < Ldlogd Ldlogd^ , , 

E [|-E £ ^ () W <10max{^/^-^, — (61) 

i=l 

Thus, as long as n = £l(dlogd), combined with the earlier bound (|60p . we conclude that 



E[G(X n ')) < ^E[G2(X n ')] < [160 L 2 p(D) ^^^ + 5 2 ] l/ \ 
using the fact that L > 1. By definition of p(D), we have 



PV y V n c ~ V 16 ; ' 

where the final inequality can be guaranteed by choosing cq sufficiently large. 

Consequently, recalling our choice 5 = D/8 and using the inequality \/ a? + b 2 < \a\ + |6|, 
we obtain 

E\G(X n ')} < —D + — = —D. 

1 6 8 16 

Finally, setting t = in the concentration bound (I59D yields 

v n 1 - 16 16 2 
with probability at least 1 — 2exp ( — d ^jjr) as claimed. 
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E Proof of Lemma [6] 

We prove this lemma by applying a form of Ahlwehde- Winter matrix bound PQ, as stated 
in Appendix [Fl to the matrix yW : = ^JSfw. We first compute the quantities involved in 
Lemma El Note that Y^' is a zero-mean random matrix, and satisfies the bound 

l y(t) lb = d— — — Ife, e m el {i) ||| 2 < Ld. 
Let us now compute the quantities <jj in Lemma [71 We have 



E[(yW T )yW] = e 



d 2 rp 



Rj(i) Cfc(i) 



e k(i) e k(i) 



dl. 



dxc 



d. 



and similarly, E[F« (y«) T ] = dl dxd , so that 

a, 2 = max||E[y« (y (4) ) T ]|| 2 , |E[(y«) T y<*)]| 2 } 

Thus, applying Lemma [7J yields the tail bound 

n t 2 t 

P[|||5>X«||| 2 >t] <2dmax{exp(-— ), exp(-— )}. 

i=i 

Setting t = nd, we obtain 

P [|-E e ^ W l2 > 2^5] < 2d m ax{exp(-^-),exp(-^)}. 

i=l 

Recall that for any non-negative random variable T, we have E[T] = J °°P[T > s]ds. 
Applying this fact to T := |||i Y17=i e i^^|2 and integrating the tail bound, we obtain 



E[\\-J2^X {i} h) < 10 max{W^^, ^^}, 
n z — ' V n n 

i=i 



r Ldlogd Ldlogd, 

< 11) max W , \, 

1 n n 



where the second inequality follows since L > 1. 



F Ahlswede- Winter matrix bound 

Here we state a Bernstein version of the Ahlswede- Winter tail bound pQ for the operator norm 
of a sum of random matrices. The version here is a slight weakening (but sufficient for our 
purposes) of a result due to Recht [22]; we also refer the reader to the notes of Vershynin [5T] . 
and the strengthened results provided by Tropp [30] . 

Let yw be independent d r x d c zero-mean random matrices such that |||Y^||| 2 < M, and 
define 

a 2 := max{|||E[(Y«) T Y«]||| 2 , ||E[Y«(Y«) T ]| 2 }, 
as well as <r 2 := Ya=i a i- 
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Lemma 7. We have 

n t 
P[|^y W |2 >t]< (d r x d c ) max{exp(-t 2 /(4a 2 ), exp(-— )} (62) 

i=i 

As noted by Vershynin [31], the same bound also holds under the assumption that each 
is sub-exponential with parameter M = [[Y^H^. Here we are using the Orlicz norm 

\\Z\\^ := inf{t > | E[V»(|Z|/t)] < oo}, 

defined by the function ip\{x) = exp(x) — 1, as is appropriate for sub-exponential variables 
(e.g., see the book [15]). 
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